-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix more Unicode bugs #24182
base: master
Are you sure you want to change the base?
Fix more Unicode bugs #24182
Conversation
d539093
to
b378d1f
Compare
b378d1f
to
db728d4
Compare
@fmeum Mind rebasing before I review? |
db728d4
to
f47f328
Compare
@tjgq Rebased! Let me know if you would prefer me to split this up further to simplify the review. |
@bazel-io fork 8.0.0 |
No-op. This parameter is deprecated and will be removed in a future version of \ | ||
Bazel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make sure we add this to the release notes when we cherry-pick it.
(Ordinarily, I would ask you to add a RELNOTES[INC]
to the PR description, but I suspect that's counter-productive: we don't have a process to pick up release notes from CLs submitted after the branch is cut, so it would probably end up in the release notes for 9.0.0 instead.)
@@ -182,7 +182,7 @@ Builder setQuiet(boolean quiet) { | |||
|
|||
private static String toString(ByteArrayOutputStream stream) { | |||
try { | |||
return new String(stream.toByteArray(), UTF_8); | |||
return stream.toString(UTF_8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this one not ISO_8859_1, matching the other branches in execute
?
String tpl = FileSystemUtils.readContent(t.getPath(), StandardCharsets.UTF_8); | ||
// Read and write files as raw bytes by using the Latin-1 encoding, which matches the encoding | ||
// used by Bazel for strings. | ||
String tpl = FileSystemUtils.readContent(t.getPath(), StandardCharsets.ISO_8859_1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Static import (also below)
public static String encode(String username, String password) { | ||
return "Basic " | ||
+ Base64.getEncoder().encodeToString((username + ":" + password).getBytes(UTF_8)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be better for NetrcParser
to decode the username/password into Latin1 and use Latin1 here? This would avoid creating an exception to the rule "strings in the Bazel heap are raw bytes".
(Alternatively, add a comment noting the exception.)
@@ -138,8 +138,7 @@ private void createVendorFile(Path vendorPath, Path vendorFilePath) | |||
throws VendorFileFunctionException { | |||
try { | |||
vendorPath.createDirectoryAndParents(); | |||
byte[] vendorFileContents = VENDOR_FILE_HEADER.getBytes(UTF_8); | |||
FileSystemUtils.writeContent(vendorFilePath, vendorFileContents); | |||
FileSystemUtils.writeContent(vendorFilePath, StandardCharsets.UTF_8, VENDOR_FILE_HEADER); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is absolutely correct, but is an interesting exception to the "strings are raw bytes" rule (literals in source code). Let's also add a comment.
Also, static import.
@@ -171,7 +171,7 @@ public FilterShowIncludesOutputStream(OutputStream out, String sourceFileName) { | |||
public void write(int b) throws IOException { | |||
buffer.write(b); | |||
if (b == NEWLINE) { | |||
String line = buffer.toString(StandardCharsets.UTF_8.name()); | |||
String line = buffer.toString(StandardCharsets.UTF_8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this an exception to the "strings are raw bytes" rule?
Specifically, won't these strings later be compared against filenames coming from BUILD/bzl files, which are raw bytes?
setup_starlark_repository | ||
|
||
if "$is_windows"; then | ||
# äöüÄÖÜß in UTF8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: "UTF-8" (also below)
@@ -242,6 +250,43 @@ function test_dependency_pruning_scenario() { | |||
check_unused_content "pkg/c.input" | |||
} | |||
|
|||
function test_dependency_pruning_scenario_unicode() { | |||
if "$is_windows"; then | |||
# äöüÄÖÜß in UTF8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: "UTF-8" (also below)
unused_input_list
handling of non-ASCII characters in file names.legacy_utf8
parameter ofrepository_ctx.file
toFalse
and make it a no-op. With the previous default, any non-ASCII characters would be written out as double encoded UTF-8, which is not a useful choice.repository_ctx.template
to operate on raw bytes for consistency withrepository_ctx.read
and to fix substitution with non-ASCII keys/values.UTF_8
closer to their usage site to clarify why they are correct.